Customer Data Project: Segmentation

September 2022
Fatih Catpinar

The aim of the project is to review the dataset, explore it and use machine learning and visualizations to help the marketing team to segment the customers in order to better target them for email campaigns to increase sales.

The analysis will answer the questions: What type of customer the marketing team needs to target? What are the characteristics of that customer? The objective is to classify a customer so the marketing team can send the right offers to the right clients.

In this project, the following topics are studied; data exploration and cleaning, customer lifetime value and RFM (recency, frequency, monetary) segmentation and analysis, K means clustering.

Also, additional possible solutions that can be offered with the provided data to help the marketing team is discussed.

Table of contents

1. Explore and clean the data

The first step is going to be to load the data and explore. Before we do more analyses and create classification models, we need to understand the data and make sure there is no incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. We will check if the data has duplicate information, or has any missing values.

The data is the sales data. The size of the dataset is about 541909 rows and 8 columns. Data columns are

Each row repesent a sale of a specific item. Each InvoiceNo only has one CustomerID. InvoiceNo can have multiple items, it means each invoice can be consist of multiple rows. Each item in the row has a description. Invoices has date, time and location information.

Handling duplicates

There are 5268 duplicated rows. We need to drop the duplicates since they might give you an inflated results while analysing the data.

Handling missing values

Missing values can cause bias in the machine learning models and reduce the accuracy of the model. We need to handle the mssing values to be able to create an healty segmentation model.

There are 1,454 missing 'Description' which is 0.27% of the data and 135,080 missing 'CustomerID' which is 25.16% of the data. Since the analysis is based on customers, need to remove missing values from the CustomerID column. Assuming that the orders with no CustomerID were not from the customers already in the dataset.

There was no need to handle the missing 'Description'. We have the stock code for the purchased items. But, removing missing 'CustomerID' took care of the missing 'Description' issue.

Organizing the dataframe

The data is collected between 2010-12-01 and 2011-12-09.

Generate descriptive statistics

The minimum value for 'Quantity' is negative which is not possible. Need to explore the reason for the negative values. If there is a special meaning that help us extract more inforation, we will use it. Otherevise we will remove the negative quantity data.

All of the negative 'Quantity' has "InvoiceNo" start with letter 'C'. It seems it means the invoice is cancelled. It is good to have the cancelled transections in the data, so that more analys can be made. What is the most canceled item? What is the percentage of the cancelation? Who canceled most? If the canceled orders are investigated more, there is a chance to prevent future cancellations. But, for the customer value segment study, the canceled invoices is going to be removed.

Check "StockCode" and "Description" features

The "StockCode" and "Description" don't have same unique count. It might be because there are some StockCode with multiple Description.

Check unique customer and item count.

There are 4339 unique customers and there are 3665 unique items.

Explore the location feature

There are 37 countries. Investigate the following information to understand the country affect:

Plot the country factor in the data

As seen in the bar plots, United Kingdom the most number of transactions, most total sales revenue, most customers and most total sold units. The most of the data is from United Kingdom.

We have an unbalanced data by country. To make sure that the culture bias create variation, we will focus on one country at a time. Since the biggegest sale is from UK, we should focus on the biggest country.

For the rest of the analysis, United Kingdom data is going to be used.

What is the most sold item in UK?

What is the most frequent purchased item in UK?

Plot the montly sales in UK

2. Customer lifetime value and RFM (recency, frequency, monetary) segmentation

After we explore and clean the data, we can now segment the customers in order to better target them for email campaigns to increase sales.

Customer Lifetime Value indicates the total revenue from the customer during the entire relationship. Customer Lifetime Value helps companies to focus on those potential customers who can bring in the more revenue in the future.

To understand the best customers, most profitable customer, and the lost customers, we will create recency, frequency, monetary column for each customer. In our case;

RFM Generate descriptive statistics

Plot recency

The following plot shows Number of Days from last purchase of the customer which is called recency for every customer

The above histogram shows that most of the customer last purchase date is 100 and lower.

Plot frequency

The following plot shows Number of Invoice of specific customer is frequency for every customer

Plot monetary

The following plot shows the Custumer Lifetime total Purchase value is monetary

RFM Model

We can create customer segments from an RFM model by using Quartiles. We will assign a score from 1 to 4 to each category (Recency, Frequency, and Monetary). 1 is the lowest score and 4 is best score.

As seen from the pie chart above; after the quartile scoring the distribution of the customers are almost same for the recency. It means the number of customer whose last purchase is

You are looking at the customer frequency score distribution pie chart above.

You are looking at the customer monately score distribution pie chart above.

  1. The customers who spend 298.11 and less
  2. The customers who spend between 298.11 and 644.30
  3. The customers who spend between 644.30 and 1570.81
  4. The customers who spend 1570.81 and up

Combine the RFM Scores

We can concat the recency, frequency, monetary scores; so that we willl have one combined score to segment the costumers

We can define the best customer as RFM_Score is 444. Since 4 is the best score for each category.

Distrubution of the customers based on RFM Scores

We can categorize the customers based on their RFM score.

3. Customer clustering with K means

In this section, we will segment the customers into different classes.

Possible algorithms are Logistic Regression, K-means Clustering, and K-nearest Neighbor. We don't have labels. We need to use an Unsupervised classification. We want to use a simple, cost effective algoritm. K-Means is very easy and simple to implement. It is highly scalable, can be applied to both small and large datasets.

In order to find the best number of clusters that best represents the data, we will use elbow methd and the silhouette score.

Standardize the data

Since kmeans use distance-based measurements to determine the similarity between data, it is a good idea to’s standardize the data to have a mean of zero and a standard deviation of one.

Check Data Skewness

One of the key kmeans assumptions is symmetric distribution of variables. Skewness is asymmetry in a statistical distribution, in which the curve appears distorted or skewed either to the left or to the right.

The data is skewed as seen from the following plots. We need to handle it before applying the machine learning model.

Handle Skewness

Find the number of clusters (k) for the k-means model

We will try two methods to define the best k.

The first one is the elbow method. Elbow method is a way to find out the best value of k. For a ramage of k values, it calculates the sum of the square of the points and calculates the average distance. Finally, we will plot a graph between k-values and the sum of the square to get the k value. At some point, our graph will decrease abruptly. That point will be considered as a value of k.

The second method is Silhouette score. The Silhouette score is used to measure the degree of separation between clusters. The value of Silhouette score varies from -1 to 1. If the score is 1, the cluster is dense and well-separated than other clusters.

Elbow Method to find best k

Using the elbow method to determine the optimal number of clusters for k-means clustering

From the above we can see that at k=3 the plot decrease abruptly. We can select k as 3 or 4.

We can also use KElbowVisualizer package to see the best k value on plot below.

Silhouette score method to find the best k

As seen above the maximum silhouette_score is when the cluster is 3. So we pick the best number of cluster as 3.

The best cluster number is 3. And we will use k =3. We can still use more clusters if we want to segment the data in more classes. But for this project I will use 3 classes.

K-means Model

As seen above we have devided the custumers into 3 groups. But we still don't know what they mean. We can do more analtyz and see the caracteristic of each cluster. We want to learn which one is the most frequent buyer, which group is the top customer, exc.. We can also do some investigatuion and lean what type of items each cluster buys.

What does each cluster means?

To answer this question we can first use our logic in the RFM model. First we need to merge the data so see what customer categories each cluster have.

We understand that

4. Additional possible solutions with the data

  1. Sales recomendation

With the data we have, we can anticipate the customers and do sales recomendations.

  1. Sales forecast

From the available date we can additionaly do and sales forcast.

  1. Customer Segmentation based on consumption habits

The customers can be segmented the customers by their consumption habits. The item category information would be helpful. The customers can be classified based the category of products, the number of purchase, and total payment.

Additional data that might help with the proposed solutions:

5. Conclusion

In this project, Customer Lifetime Value RFM Segmentation, K-Means Clustering are used to help the marketing team to segment the customers in order to better target them for email campaigns to increase sales.

We know that different groups require different marketing approaches and we want to figure out which group can boost the profit the most.

To do that, the customers in the dataset are divided into clusters with RFM Segmentation based on customer purchase history and custumer Lifetime value. We were able to categoriged the data into 9 groups including "Top customer need attention" and "customers need attention" so that the marketing team can prioritize their strategy.

With Kmeans unsupervised machine learning algorithm, we have classified the customers into three clusters. Cluster 0 means "Current Customer", Cluster 1 means "Top Customer", and Clister 2 means "Lost Customers". The marketing team can still use the clustered data to personalize the promotions to each group based on their needs. However, RFM model gives more information than K-means algorithm. We can re run the algorith with more features and plot the results to understand the custumor type more clearly.